Dataset: Medical Information Mart for Intensive Care-IV (MIMIC-IV)
Privacy-preserving large language models for structured medical information retrieval
NLP Tasks: Information Extraction
Method: an open-source pipeline using the local large language model (LLM) "Llama 2"
Metrics:
- Accuracy (100% sensitivity, 96% specificity)
- Sensitivity (Ascites: 95%)
- Specificity (Ascites: 95%)
- Sensitivity (Confusion: 76%)
- Specificity (Confusion: 94%)
- Sensitivity (Abdominal pain: 84%)
- Specificity (Abdominal pain: 97%)
- Sensitivity (Shortness of breath: 87%)
- Specificity (Shortness of breath: 97%)
Racial, ethnic, and sex bias in large language model opioid recommendations for pain management
NLP Tasks: Text Classification, Question Answering, Information Extraction
Method: instructing large language models (LLMs), specifically GPT-4 and Gemini, to provide subjective pain ratings and comprehensive pain management recommendations
Metrics:
- Pain rating severity (OR: 0.57, 95% CI: [0.54, 0.60], P < 0.001)
- Strong opioid recommendation (OR: 2.05, 95% CI: [1.59, 2.66], P < 0.001)
- Timing of opioid recommendation (OR: 1.41, 95% CI: [1.22, 1.62], P < 0.001)
Automation of Trainable Datasets Generation for Medical-Specific Language Model: Using MIMIC-IV Discharge Notes
NLP Tasks: Text Generation
Method: a novel approach for generating machine-generated instruction datasets for fine-tuning medical-specialized language models using MIMIC-IV discharge records
Metrics:
- Mean ROUGE (0.185)
- Validity rate by GPT-3.5 (88.0%)
- Validity rate by human annotator (88.5%)
LCD Benchmark: Long Clinical Document Benchmark on Mortality Prediction for Language Models
NLP Tasks: Text Classification, Information Extraction, Question Answering
Method: LCD benchmark for predicting 30-day out-of-hospital mortality using discharge notes.
Metrics:
- Accuracy (over 99%)
- Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
- Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)
Quantitative Evaluation of Large Language Models to Streamline Radiology Report Impressions: A Multimodal Retrospective Analysis
NLP Tasks: Text Generation, Text Classification, Question Answering
Method: Comparative analysis of four publicly available large language models (LLMs)
Metrics:
- Accuracy (over 99%)
- Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
- Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)
AnnoDash, a clinical terminology annotation dashboard
NLP Tasks: Information Extraction, Text Classification, Named Entity Recognition
Method: AnnoDash, a flexible dashboard to support annotation of concepts with terms from a given ontology.
Metrics:
- Accuracy (over 99%)
- Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
- Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)
Evaluation and mitigation of the limitations of large language models in clinical decision-making
NLP Tasks: Information Extraction, Text Classification, Question Answering
Method: Creating a framework to simulate a realistic clinical setting using a curated dataset based on the Medical Information Mart for Intensive Care database
Metrics:
- Accuracy (over 99%)
- Accuracy (NAME: 10.2%, NAME+SYN: 36.1% with typos, NAME+SYN: 61.8% with typo-specific fine-tuning)
- Accuracy (NAME: 11.2%, NAME+SYN: 92.7% for unseen synonyms)